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Abstract 

This paper proposes a novel learning method for a mixture of recurrent neural network (RNN) 
experts model, which can acquire the ability to generate desired sequences by dynamically switch- 
ing between experts. Our method is based on maximum likelihood estimation, using a gradient 
descent algorithm. This approach is similar to that used in conventional methods; however, we 
modify the likelihood function by adding a mechanism to alter the variance for each expert. The 
proposed method is demonstrated to successfully learn Markov chain switching among a set of 9 
Lissajous curves, for which the conventional method fails. The learning performance, analyzed 
in terms of the generalization capability, of the proposed method is also shown to be superior to 
that of the conventional method. With the addition of a gating network, the proposed method 
is successfully applied to the learning of sensory-motor flows for a small humanoid robot as a 
realistic problem of time series prediction and generation. 



1 Introduction 



How to learn complex temporal sequence patterns by means of neural network models has been a 
challenging problem. Recurrent neural networks (RNNs) [TTJ ^ [2T1 [27] have been one of the most 
popular models applied to temporal sequence learning. RNNs can learn sensory-motor sequence 
patterns jTTl [T^ , symbolic sequences with grammar [51 and continuous spatio-temporal patterns 

m- 

In spite of the considerable amount of RNN research carried out since the mid 1980s, it has been 
thought that RNNs cannot be scaled to be capable of learning complex sequence patterns, especially 
when the sequence patterns to be learnt contain long-term dependency [T]. This is due to the fact 
that the error signal cannot be propagated effectively in long-time windows of sequences using the 
back-propagation through time (BPTT) algorithm [22] , because of the potential nonlinearity of the 
RNN dynamics [I] . 

There have been some breakthroughs in approaches to this problem. Hochreiter and Schmidhuber 
[51123] proposed a method called "Long Short-Term Memory" (LSTM). LSTM learns to bridge minimal 
time lags in long time steps by enforcing constant error flow through "constant error carousels" within 
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special units. These units learn to open and close access to the constant error flow. Echo state networks 
[21 llOj and a very similar approach, liquid state machines "18], have recently attracted significant 
attention. Echo state networks possesses a large pool of hidden neurons with fixed random weights, 
and the pool is capable of rich dynamics. Jaeger [TD] demonstrated that an echo state network can 
successfully learn the Mackey-Glass chaotic time series which is a well-known benchmark system for 
time series prediction. 

Tani and Nolfi. |26j investigated the same problems, but from a different angle, focusing on the 
idea of compositionality for sensory-motor learning. The term compositionality was adopted from 
the "Principle of Compositionality" in linguistics, which claims that the meaning of a complex 
expression is determined by the meanings of its constituent expressions and the rules used to combine 
them. This principle, when translated to the sensory-motor learning context, leads to the assertion that 
diverse and complex patterns can be learned to be generated by adaptively combining a set of re-usable 
behavior primitives. Here, acquiring behavior primitives requires a mechanism for autonomously 
segmenting a continuously experienced sensory-motor flow into reusable chunks. Tani and Nolfi (1998) 
proposed a scheme for hierarchical segmentation of the sensory-motor flow, applying the idea of a 
mixture of experts [S] [131 HI] to hierarchically organized RNNs. Consider a network consisting of 
multiple local RNNs, organized in levels. At the base level, each RNN competitively learns to be 
an expert at predicting/generating a specific sensory-motor profile. As the winner among the expert 
modules changes, corresponding to structural changes in the sensory-motor flow, it is considered that 
the sensory-motor flow is segmented, by switching between winners. Meanwhile, the RNN experts 
at the higher level learn the winner switching sequence patterns with much slower time constants 
compared to those at the sensory-motor level. The higher level learns not for details of sensory-motor 
patterns but for abstraction of primitive sequences with long-term dependency. 

Although the scheme of the mixture of RNN experts seems to be tractable, in reality there are 
potential problems arising from scaling [55]. It is known that if the number of modules increases, 
segmenting the sequence becomes unstable and learning by a gating mechanism tends to fail. This 
is due to near-miss problems in matching the current sensory-motor pattern to the best local RNN 
among others, which have acquired similar pattern profiles. Since the number of modules determines 
the scalability of the system, it is desirable that learning proceeds stably with larger numbers of 
modules. 

In the present study, a novel learning method is presented for a mixture of RNN experts model. 
The proposed learning method allows a sequence to be segmented into reusable blocks without loss of 
stability with increasing number of learning modules. A maximum likelihood estimation based on the 
gradient descent algorithm is employed, similar to the conventional learning method for mixture of 
RNN experts [26l [6] , although the present likelihood function differs from those presented previously. 
In the proposed method, the weighting factor for the allocation of blocks to modules is changed 
adaptively via a variance parameter for each module. The superior performance of the proposed 
method is demonstrated through application to a problem that involves learning to extract a set 
of reusable primitives from data constructed by stochastically combining multiple Lissajous curve 
patterns. The proposed learning method is compared with the conventional method on the basis of 
generalization capability as a measure of the extent to which compositionality can be achieved in 
modular networks. Finally, the proposed method is applied in combination with a gating network to 
learn the sensory-motor flows for a small humanoid robot as a realistic problem. 



2 Model 

In this section, we define a model known as a mixture of recurrent neural network (RNN) experts 
model. A mixture of RNN experts is simply a mixture of experts model for which the learning modules 
are RNNs. The model equations are defined as follows: 
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cW=tanh(M«), (2) 
y«^tanh(W^3«cW+4^)), (3) 

AT 

yn=Y.9^yn^ (4) 

i=l 

where N is the number of expert modules, e is a time constant satisfying < e < 1, a;„ and y„ 
are the input and output vectors of the model at time n, respectively. For each i, Un\ c"n and yi'^ 
denote internal potential of context neurons, states of context neurons and states of output neurons 
of the module «, respectively. The matrices Wi\W2\w^^^ and vectors vf' ^V2^ are parameters of 
the module i. A gate opening value for the module i is denoted by gn \ and it is assumed that gn^ > 
and X^ili 5" "* — 1- The gate opening vector g„ represents the winner-take-all competition among 
modules to determine the output y„. The determination of the gate opening vector g„ is explained 
in section 



2.1 Learning method 

The proposed learning method is defined on the basis of the probability distribution of the mixture of 
RNN experts. Here the parameters of learning module i are denoted by = (W^'' , , Wg'*^ , , Mq*^ ) , 

and a gate opening vector g„ is described by a variable /3„ such as 

Let X = {xn)n=i be an input sequence, ,9„ and 7 be parameters, where 7 is a set of parameters 
given by 7^ — Ci). Given X, /3„, and 7, the probability density function (p.d.f.) of output y„ is 
defined by 

JV 



p(y„ I X, /3„, 7) = E aii^P^yn I ^' 7»), (6) 



where p{y„\X, 7^) is given by 



d is the output dimension, and yl,*-* is an output of the module i computed by the equations ([T]), ^ and 
([3]) with parameter -di. Thus, the output of the model is governed by a mixture of normal distributions. 
This equation results from the assumption that an observable sequence data is embedded in additive 
Gaussian noise. It is well known that minimizing the mean square error is equivalent to maximizing 
the likelihood determined by a normal distribution for learning in a single neural network. Therefore, 
equation ([7]) is a natural extension of a neural network for learning. The details of the derivation of 
equation ([7]) is given in 8J. 

Given a parameter set 6 = ((/3„)^=i,7) and an input sequence X, the probability of an output 
sequence Y = {yn)n=i is given by 

T 

p{Y \X,e) = l[p{y,,\X,f3,„j)- (8) 
The likelihood function L of the data set D = {X,Y), parametrized by 6, is denoted by 

L{D,e)^p{Y\x,eMe), (9) 
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where (p{9) is the p.d.f. of the prior distribution given by 



T-l N 



n—1 i—1 



(10) 



This equation means that the vector /3„ is governed by A^-dimensional Brownian motion. The prior 
distribution has the effect of suppressing the change of gate opening values. 

The learning method proposed here is to choose the best parameter by maximizing (or integrating 
over) the likelihood L{D, 9) with training data D. Concretely, we use the gradient descent method 
with a momentum term as the training procedure. The update rule for the model parameter is 



Mit) = a 
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7yA0(t-l), 



(11) 
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where 6{t) is the parameter at learning step t, a is the learning rate, and rj is the momentum term 
parameter. For each parameter, the partial differential equation gg^'^^ given by 



aini(i?,0) _ .g,(*'(p(y„ I X,7,)-p(y„ I X,/3„,7)) , G! 
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if n = 1, 
if 11 = T, 
otherwise, 
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Notice that — UnW'^ can be solved by the back propagation through time (BPTT) method 

|22j . In this paper, we assume that the infimum a of the parameter ai is greater than 0, because if ai 
converges to 0, then A9{t) diverges. 

The explanation above refers to a training data set D consisting of a single sequence. However, 
the method can be readily extended to the learning of several sequences by calculating the sum of 
gradients for each sequence. When several sequences are used as training data, the initial states Uq ^ 
and the parameters /3„ of the gate opening vector must be provided for each sequence. 

An important difference between the proposed method and the conventional method used in PSIIS] 
is the use of an optimized variance cr of the normal distribution. Indeed, if cr is a constant, e.g., <t = 1, 
then the proposed method is equivalent to the conventional one. Adapting the variance cr results in 
different learning speeds for different learning modules, because the learning speed of a module i 
depends on ai, from equation (|15p . Moreover, from equation (I16p . at decreases if the mean square 
error of the module i is less than daf. This fact implies that there exists positive feedback in the 
learning speed of modules. When the same sequence blocks are allocated to a set of modules, the 
learning speed of the best matching module becomes faster than for the other modules in the set due 
to this feedback mechanism, reinforcing the matching between the module and the allocated blocks. 
Hence, this feedback leads to a decision on the winner of the competition among modules. It is shown 
later that the adaptive optimization of cr plays an important role in segmenting the sequence D into 
blocks via the gate opening vector g„. 
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Figure 1: (a) Training data generated by Markov chain switching of 9 Lissajous curves, (b) Each 
Lissajous curve. The subscript of each figure denotes the index of each Lissajous curve. The transitions 
among curves are consonant with continuity of the orbit. 



2.2 Feedback loop with time delay 

Let us consider the case in which there exists a feedback loop from output to input with a time delay 
T, that is to say, the model is an autonomous system. In this case, a training data set D = {X, Y) 
has to satisfy y„ = Xn+r- At the end of learning, if every output of the model is completely equal 
to the training data, the model can generate the sequence Y using a feedback loop instead of using 
the training data X as external input. In this paper, open-loop dynamics refers to the case in which 
external inputs are given by training data, and closed-loop dynamics refers to the case involving 
self- feedback. In section [31 we consider the case in which the model is an autonomous system. 



3 Numerical simulation 
3.1 Learning 

The learning ability of the proposed learning method with optimizing cr is compared here that of the 
conventional method (constant cr) and a standard RNN model with BPTT [25] as a benchmark. The 
architecture of the RNN with BPTT method is the same as that for a module of the mixture of RNN 
experts model. The training data is a two-dimensional sequence generated by Markov chain switching 
of 9 Lissajous curves, each of period 32 (see Figure [1]). Assume that the transition probabilities of 
the Markov chain are such that transitions among curves are consonant with orbit continuity. Since 
we consider the model which has a feedback loop with time delay r, the training data set D = {X, Y) 
satisfies — Xn+r ■ We provides details of the training data in Appendix [^ 

We describe the experimental conditions in this section. The training data was of length T = 
10, 000, and learning was conducted for 300, 000 steps in each model. The parameter settings were 
set to TV = 24 learning modules, dim = 10 context neurons in each learning module, a time constant 
of e = 0.1, a time delay of r = 5 for the feedback loop, an infimum oi a — 0.05 for the variance 
of the normal distribution, a standard deviation of <r = 1 for the prior distribution, a momentum of 
77 = 0.9, and a learning rate of a = O.Ol/Td, where T is length and d is the dimension of the training 
data. The learnable parameters of the mixture of RNN experts were initialized as /3„ = and cr = 1, 
every element of the matrices W^^^ , Wg*-* , W^^^ and the vectors w^'-* , v^'^ were initialized randomly from 
the uniform distribution on the interval (—0.1, 0.1), and an initial state itg*^ for each i was initialized 
randomly from the interval [—1, 1]. In the RNN with BPTT method, the number of context neurons 
was set to 240, corresponding to the total number of context neurons in the mixture of RNN experts 
model, and the parameters of the RNN were initialized in the same manner as for a module of the 
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Learning step 



Learning step 



Figure 2: Mean square error for each learning step, (a) Open-loop dynamics, (b) Closed-loop 
dynamics. 



mixture of RNN experts model. 

Figure [2] displays the mean square error for each learning step, as defined by 

1 ^ 

E^^Y.Wyn-ynW'', (17) 

n=l 

where y„ and y„ denote the training data and the output of the model, respectively. The error of 
the trained model is evaluated by computing both the open-loop and closed-loop dynamics for each 
learning step. The error for open-loop dynamics denotes the r-step prediction performance of the 
model. As model learning reduces the error for open-loop dynamics by the equation ([T5)) . this error 
represents the progress of learning. However, from the point of view of temporal sequence learning, 
the error for closed-loop dynamics is a more important indicator of the performance in generating the 
desired sequences. In general, even if the error for open-loop dynamics is sufficiently low, the error 
for closed-loop dynamics may not necessarily be reduced because temporal evolution may increase 
the margin of error via the self-feedback loop. The reduction of error for closed-loop dynamics thus 
requires that the model successfully organize basins of attraction from the training data. 

In Figure [21 the error for open-loop dynamics for the RNN with BPTT is smaller than that for 
the mixture of RNN experts under constant cr, whereas the error for closed-loop dynamics shows 
the opposite result. This result implies that the RNN with BPTT is unable to construct basins of 
attraction corresponding to the training data, and so the RNN cannot generate sequences similar to 
the training data. This result also demonstrates the reduction in the error for both open-loop and 
closed- loop dynamics by adaptively optimizing cr. The convergence of error in the proposed method 
was also achieved faster than in the conventional constant cr case. The results indicate that the use of 
adaptively optimized cr in the learning scheme for the mixture of RNN experts model tends to provide 
better performance than achieved by the conventional and benchmark schemes from the viewpoint 
of sequence prediction and generation. The results indicate that the mixture of RNN experts with 
adaptive cr tends to provide better performance than achieved by the conventional method and the 
standard RNN with BPTT method from the viewpoint of sequence prediction and generation. 

Figure [3] indicates that variances associated with many modules converged to the infimum a, 
because there was no additional noise in the training data. Hence, it might be thought that the model 
is capable of learning the data without optimization of cr, ii Ui — a at the beginning. Training under 
this condition, however, often fails. The reason is that the parameter 'di of each learning module i 
diverges, because of the huge gradient A6'(i). These very large values for the gradient AO{t) occur 
because, in the initial learning phases, the mean square error is large but the variance is very small. 
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Learning step 



Figure 3: The parameter cr under adaptive optimization. 
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a is constant 
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Figure 4: The number of elements in Q for each learning step. 



The probability q{i,n) that learning module i is selected at time n is 



Let us consider the function 

<7max(n) = argmax(7(i, n) (19) 

l<j<Af 

that denotes the index of the module maximizing the probability q(i, n), and a set Q defined by 

Q = {i \3n qmaxin) = i}. (20) 

IQI, the number of elements in the set Q, indicates the number of modules which can be effectively 
used to generate the sequence Y. In Figured we have plotted \Q\ for each learning step. There 
are two phases in the learning process. In the first phase \Q\ increases, and in the second phase it 
decreases. This phenomenon often appears in the learning process when a mixture of RNN experts 
is utilized. The reason for this behavior of \Q\ could be as follows: In the first phase, while the mean 
square error is still large, the gradient A6{t) is directed to use learning modules to decrease the error; 
in the second phase, when the error has become small, the effect of the prior distribution becomes 
dominant and \Q\ decreases owing to suppression of the change of gate opening values. Furthermore, 
in the case of adaptive cr, jQI does not decrease until cr converges, because even if the error decreases, 
the variance cr also decreases by just that much, and therefore the values of the derivatives obtained 
by equations (fT5|) and (fT6|) do not become small. Thus, if cr is optimized, \Q\ decreases over the range 
for which the size of the error has been sufhciently reduced. On the other hand, if cr is constant, 
then the error and \Q\ decrease together. However \Q\ can often decrease at a large rate, with a 
concomitant reduction in the rate at which the error decreases. 

Figure [5] displays a snapshot of the training data, output and gate opening values at the end of 
learning. Although the training data is a 2-dimensional sequence, we have plotted only one dimension. 



7 



(a) 




3000 3200 3400 3600 3800 4000 3000 3200 3400 3600 3300 4000 



time time 

Figure 5: A snapshot of the training data, output and gate opening values at the end of learning, (a) 
The training data, (b) The case in which cr is optimized, (c) The case of constant cr. In (b) and 
(c), the upper figures display output of trained models for the closed-loop dynamics, and lower figures 
display gate opening values, where the number over a gate opening value denotes the current opening 
gate. 



The result clearly shows that the trained model with adaptive <t can reconstruct spatio-temporal 
patterns in the training data successfully by adaptively switching the gates, but this is not so for 
the case with constant cr. Figure [5] (a) and (b) display the output of the trained model and the 
output of modules in the case of adaptive cr, respectively. Figures [5] (c) and (d) also display these 
outputs in the case of constant cr. It can be seen that the trained model successfully extracts the 
Lissajous curves as primitives in the adaptive cr case, whereas it could not extract the curves in the 
constant cr case.. These results indicate that the model with adaptive cr was able to segment the data 
into blocks corresponding to Lissajous curve patterns, and could allocate self-organized patterns in 
each corresponding module. The model with constant cr, however, was not able to perform this task 
effectively. 

3.2 Generalization 

Here we examine the generalization capability of the proposed method as compared with that of 
the conventional method. In order to evaluate this generalization capability, we prepared test data 
separate from the training data, and we regarded the mean square error for regenerating the test 
data as the generalization error. However, the gate opening values determined by the parameter /3„ 
correspond only to the training data. Thus, in computing the generalization error, we reconstruct 
/3„ that maximizes the likelihood for the test data by using the same learning scheme. Note that 
the parameter di of each learning module and the variance cr do not change in the reconstruction. 
The aim of this analysis is to evaluate the general feasibility of the segmentation and data allocation 



8 



(a) (b) 




Figure 6: Trajectories generated by trained models in the closed-loop dynamics. Here (a) and (b) 
display outputs of the trained model and the output of modules in the case of adaptive a, respectively, 
(c) and (d) also display these outputs in the case of constant <t. Notice that the output of a module 
i is plotted if (7max(^) = i, namely, if gate i opens at time n. If gate i never opened, then drawing the 
module i is omitted. 
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scheme based on a gating mechanism. If the gating mechanism successfully segments the training data 
into blocks as reusable primitives, and each module acquires a rule for allocating blocks in the training 
phase, not only the training error but also the generalization error will be small. The value of IQj can 
also be regarded as a measure of generalization, representing the number of primitives extracted from 
the observed data. 

To explain practical meaning of the generalization error and \Q\ more clearly, let us consider 
the case in which a gating network, which predicts or generates gate opening values, is added to 
the proposed model. Although the definition of the model in section [2] does not include the gating 
network because of focusing on the segmentation of temporal sequences, the gating network is usually 
used together with experts when applying to realistic problems. In this case, if the generalization 
error of experts is sufficiently reduced, the generalization error of the whole system can be reduced 
by decreasing that of the gating network. It is therefore sufficient to discuss the generalization ability 
of a single network when evaluating the generalization performance of the whole system, and the 
generalization ability of the single network can be evaluated by the stochastic approach H]. Hence, 
the model can be considered to have good generalization performance if it minimizes the generalization 
error. Furthermore, training of the gating network tends to be easier when \Q\ is small. Accordingly, 
the model can be understood to have good generalization performance if the scheme minimizes both 
the generalization error and \Q\. 

This experiment serves to compare the case of adaptive cr with the constant case in which cr = 1 
and (T = a {= 0.05). Moreover, we can also compare them with the learning of RNN using BPTT, 
where the architecture of RNN is that of a mixture of RNN experts. As training data and test data, 
we used a 2-dimensional sequence generated by Markov chain switching of 2 Lissajous curves of period 
32, with time delay of the feedback loop r = 1. The lengths of the training and test data were 1000 
and 2000, respectively. In the optimizing cr case, we initialized cr to 5. Other parameter settings were 
the same as for Experiment 1. 

The generalization error and \Q\ after 100, 000 learning steps are displayed in Figure [T] The results 
are shown for 10 samples with different initial conditions for each TV, the number of learning modules. 
The generalization error for closed-loop dynamics is lower for the case of optimizing cr than for the 
other cases, and \Q\ for the optimizing a is as low as that achieved by any of the other cases. In the 
case oi cr = &, even though the generalization error decreases with increasing N to match that of the 
adaptive cr case, \Q\ increases markedly with N. In the case of cr = 1, neither the generalization error 
nor IQI are dependent on the number of modules, but the generalization error is greater than that in 
the adaptive or case. These results suggest that only the generalization error or \Q\, not both, can be 
reduced if cr is constant. Thus, the model with constant cr is unable to continue performing effectively 
as the number of modules increases. On the other hand, the adaptive optimization of cr satisfies the 
positive characteristics of both the cr = 1 and cr = a cases by reducing both generalization error and 
IQI even at large N . For the reason which was explained in section lSTTl by optimizing cr in the learning 
phase, IQI is low over the range in which the error margins are sufficiently reduced. These results thus 
suggest that the proposed learning method for the mixture of RNN experts model is scalable with 
respect to the number of learning modules, maintaining good performance in terms of generalization 
error with increasing N . 

Figure [8] shows the effects of the parameter on generalization in learning. The simulation com- 
puted up to 100, 000 learning steps for each parameter value <r of the prior distribution, where the 
number of modules N was set at 16. The computation was repeated 10 times for each <r. By equa- 
tion (fT3|) . the effect of suppressing gate opening change is inversely proportional to ij^. If the change 
of gate opening is suppressed to an extreme degree, the model cannot learn the training data, and 
the generalization error is increased. On the other hand, if the change of gate opening values is not 
suppressed, then \Q\ is not reduced. From Figure [8] (b), if (t is constant, dependence of |(5| on the 
parameter is stronger than in the case of adaptive cr. As a result, in the case of optimizing cr, the 
parameter range of <j for which both the generalization error and \Q\ are minimized, is larger than 
for the case of constant cr. Therefore, it can be said that the proposed method can achieve better 
generalization in learning more stably than the conventional one. 
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(a) 




o is optimized -H- <t is constant {c=0.05} -^1^ 
o is constant (ct=1 .0) RNN using BPTT -□- 



(b) 



a is optimized - 
o is constant ((j=1 .0) - 
o is constant (o=0.05) - 



Figure 7: The generalization error and |Q| after 100, 000 learning steps for each value of the parameter 
N, the number of learning modules, (a) The generalization error for the closed-loop dynamics, (b) 
The number of elements in the set Q. In the case of RNN using BPTT, the number of context 
neurons in the RNN is set to ION, that is, the total number of context neurons in the mixture of 
RNN experts. For each parameter N, we computed the results for 10 samples with different initial 
conditions, training data and test data. 




Figure 8: (a)The generalization error for the closed-loop dynamics. (b)The number of elements in the 
set Q for the test data. For each parameter we computed the results for 10 samples up to 100, 000 
learning steps, where the number of learning modules is N = 16. 
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Figure 9: Humanoid robot behavior 



3.3 Practical application 

The practical learning performance of the proposed method is evaluated by applying the scheme to 
the learning of sensory- motor flows for a small humanoid robot. The situation comprises a robot set 
before a workbench on which a cubic object was placed. The task for the robot is to autonomously 
generate desired behaviours consisting of several primitive motions (see Figure [3]): (1) reaching for 
the object, (2) moving the object up and down, (3) moving the object left and right, (4) moving the 
object forward and backward, (5) touching the object with the left and right hand alternately, and (6) 
touching the object with both hands. For each behaviour, the robot begins from a home position and 
ends with the home position. Each behaviour is described by a 10 dimensional sensory-motor flow, 
which consists of an 8 dimensional motor vector representing the arm joint angle, and 2 dimensional 
vision sense representing the object position. The task is to predict and generate sensory-motor 
sequences. 

In this section, a gating network to predict or generate gate opening values is added to the mixture 
of RNN experts model in order to realize a practically applicable scheme. In the analyses above, gate 
opening values were determined by the parameter /3„, because the main focus of this paper is the 
learning process to segment sequence data into primitives with respect to spatio-temporal patterns. 
Since segmenting sequences is one of the most difficult processes in the learning of the mixture of 
experts model, our approach focusing on the data segmentation is efficient to investigate the learning 
processe. However, such a model is unable to predict or generate unknown time series. To apply the 
model to an actual problem, the model must be extended to a fully autonomous sequence generator 
by adding a gating network for module switching. Thus, we consider the gating network learning in 
this section. Learning for the gating network can be performed by adopting the gate opening vector 
given by /3„ as a target. The definitions of the gating network and the corresponding learning method 
are described in appendix \B\ and further details can be found in a study [20] . 

The training data consists of 3 kinds of behaviors as depicted in Figure O and 3 initial locations 
of the object for each behavior, yielding a total of 9 training sequences. The model is constructed 
with A'^ = 16 experts, 20 context neurons for each expert, and 30 context neurons for the gating 
network. The time constant and time delay of the feedback loop are set to e = = 0.05 and t = 3, 
respectively. Other parameters are the same as in section 13.11 and the parameters of the gating 
network are initialized in the same way as for a module of the mixture of RNN experts. 

Figure [10] shows the mean square error for the closed-loop dynamics after training the mixture 
of RNN experts model up to 100, 000 steps and the gating network up to 50, 000 steps. Figure [TT] 
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Figure 10: Mean square error for closed-loop dynamics for learning of humanoid robot tasks, (a) 
Learning for expert modules, (b) Learning for a gating network to generate g„ in computation of 
closed-loop dynamics. 



describes the training data, output, and gate opening values of the model trained with the gating 
network. It can be seen that the model can autonomously generate sequences similar to the training 
data if <j is optimized by using our method, whereas the trained model is unable to generate similar 
sequences if <t is constant. These results demonstrate that the proposed learning method is effective 
in application to realistic problems of time series prediction and generation, even when employing a 
gating network. 



4 Discussion 

4.1 Segmentation of temporal time series caused by indeterminacy 

One of the most important problems in the learning of a mixture of RNN experts model concerns 
the segmentation of sequence data into blocks as reusable primitives. By the definition of the likeli- 
hood function, the module with smallest error margin is usually selected by the gating mechanism. 
Moreover, as the prior distribution given by equation (|10|) has the effect of suppressing the change in 
gate opening values, the gating mechanism retains the selected module until the module is no longer 
able to correctly predict sequence data. Thus, locally represented knowledge of a reusable primitive 
is brought by the indeterminacy of the sequence data, and is embedded into each module. Indeed, it 
was shown above that a mixture of RNN experts model is able to learn to generate data constructed 
by stochastically combining 9 Lissajous curves. Figure [6] (b) indicates that the adaptive optimization 
of (T allows a set of reusable chunks corresponding to the Lissajous curve patterns to be successfully 
extracted, and self-organized chunks to be allocated in each corresponding module. 

The segmentation mechanism caused by the indeterminacy has also been discussed in a study 
by Tani and Nolfi [26 . In their study, involving a robot actively exploring two rooms connected 
by a door, the sensory-motor flow was segmented by means of the uncertainty of the door opening. 
Similarly, for stochastic switching between Lissajous curves or the arbitrary composition of robot 
motions in the present study, both of which involve indeterminacy in the observable data, the proposed 
learning method allows the mixture of RNN experts model to successfully segment the data using 
the information of nondeterministic switching. The present scenarios thus reproduce the essential 
characteristics of the scenario considered by Tani and Nolfi in terms of the segmentation mechanism. 
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Figure 11: Time series of motor vector and gate opening vector. Output y„ and gate opening vector 
g„ of trained model are computed in closed-loop dynamics. For each time series, only the initial state 
of the model differs. 



14 



4.2 Dynamic change of functions 



As training data in section[31 we have considered Markov chain switching of Lissajous curves. Switch- 
ing patterns by Markov chain belongs to a simple class of systems in which a function dynamically 
changes. In addition to such Markov chain models, there are other systems utilizing dynamic func- 
tions. For instance, a switching map system is a dynamical system containing several maps, in which 
maps to govern temporal evolution of the system are dynamically switched with other maps in the sys- 
tem [23[ 119] ■ As other examples, we can consider chaotic itinerancy and function dynamics. Chaotic 
itinerancy is a class of phenomena such as chaotic itinerant motion among varieties of ordered states 
[7lll4j. and a function dynamics [15[|16[[T7] is a dynamical system on a 1-dimensional function space. 
It is still unclear whether these systems can be learned by the model using the proposed method, 
either because the primitive sequences are governed by a chaotic dynamics, or the rule governing the 
change of primitives has a long-term dependency, more complex than that of a Markov chain. Further 
analysis to explain learnability of the model is a future research topic. 



5 Conclusion 

In this study, we have presented a novel method of adaptive variance applied to learning for a mixture 
of RNN experts. In order to show the capability of the proposed method, we have shown an example 
of a learning experiment in which the model using the proposed method can learn training data 
with respect to correctly segmenting a continuous flow into primitives, whereas the model using the 
conventional method cannot learn it. The generalization capability has also been examined in order to 
compare the proposed method with the conventional method. From these results, we claim that our 
method has improved learning performance of the model. Furthermore, we have shown a humanoid 
robot experiment to make sure of a potential of our method. This experiment has inferred that 
the model utilizing our method can be applied to realistic problems of time series prediction and 
generation. 



A Training data: Markov chain switching among nine Lis- 
sajous curves 

The training data in section [5TT] comprise a 2-dimensional sequence generated by Markov chain switch- 
ing of 9 Lissajous curves (see Figure [T|). The equation for Lissajous curve i is given by 



^n, 1 



Ai cos(ain) 
Xn,2 = Ci sin(6jn + Ci) + A 



(21) 
(22) 



and the parameters ai,bi, Ci, Ai, Bi,Ci, Di are determined such that the period of each Lissajous curve 
is 32. The transition probability R governing the change between Lissajous curves is defined by 



R = 
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(23) 



and it is assumed that the transition among curves is consonant with continuity of the orbit. 
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B Learning process of the gating network 



We present the definition of a gating network together with the training procedure. The dynamics of 
the gating network are defined by 

< = (1 - e^)ul_, + e««g„_. + Wix^^^r + + <), (24) 

< = tanh«), (25) 
br,^Wlci + vl (26) 

"Ef..exp(6l'=))' ^''^ 

where a;„, and g„ denote input states, context states and output states, respectively. In addition, 
and 6„ are the internal potentials of neurons, and t represents the time delay of feedback. The 
output g„ of the gating network represents the gate opening vector at time n. To discriminate between 
the gate opening vector determined by parameter /3„ and the output of the gating network, in this 
appendix the gate opening vector determined by /3„ is denoted as g„, and is used as an estimate of 
the gate opening vector. 

Let 0^ = (VFf, W^, Wg, W4 , t>f , t>2, Mq) t>e a set of learnable parameters of a gating network. 
Assume that = (X, {gn)n=i) given. We define a likelihood L{DS , 9^) by the Dirichlet distribution 

TV 

P(flf„ ISfJ = y7^n(5«)^"', (28) 

T 

L{DS,es)^l[p{g,,\gJ, (29) 

n=l 

where Z{g^^) is the normalization constant, and g„ is given by X and 0^ . Notice that g„ in equation 
()29p is calculated by the teacher forcing technique [27] , namely, using g„_T- instead of g^-r equation 
([24|) . In addition, because of the term Xn-r on the right-hand side of equation ([24]) . g„ cannot be 



computed if n < r. It is therefore assumed that gr„ — g„ if n < r. If gti is considered to be the 
probability of choosing an expert i at time n, maximizing lni(_D3, 0^) is equivalent to minimizing the 
KuUback-Leibler divergence 

i^KL((g„)Li 1 1(9 JLi) = E E -9"^ 1^ (4!) • (2'^) 

T! = l i=l 9n 

The learning process for the gating network involves to choose the best parameter 0^ by using the 
gradient descent method for the likelihood L{D^ , 6^), in the same way as the equations pTjl and p2|) . 
The partial differential equation ^^"^g^ ^ is given by 

ainL(^^0^^^^# 

n=l 1=1 i/n 

Note that can be solved by the BPTT method. The procedure for training the model consisting 
of the mixture of RNN experts and gating network is organized into two phases as follows: 

(1) We compute the experts learning together with an estimate g„ of the gate opening vector. 

(2) Using the estimate g^ of gate opening vector as a target, we compute learning for the gating 
network. 

The training procedure progresses with phase (2) after convergence of phase (1). 
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